The data set is the final data set provided by Udacity for the intro to machine learning course.
It includes a combination of financial data and variables created from email data.
There are 22 variables.
## X bonus deferral_payments
## ALLEN PHILLIP K : 1 Min. : 70000 Min. :-102500
## BADUM JAMES P : 1 1st Qu.: 425000 1st Qu.: 79644
## BANNANTINE JAMES M: 1 Median : 750000 Median : 221064
## BAXTER JOHN C : 1 Mean :1201773 Mean : 841603
## BAY FRANKLIN R : 1 3rd Qu.:1200000 3rd Qu.: 867211
## BAZELIDES PHILIP J: 1 Max. :8000000 Max. :6426990
## (Other) :138 NA's :63 NA's :106
## deferred_income director_fees email_address
## Min. :-3504386 Min. : 3285 : 33
## 1st Qu.: -611209 1st Qu.: 83674 a..martin@enron.com : 1
## Median : -151927 Median :106164 adam.umanoff@enron.com : 1
## Mean : -581050 Mean : 89823 andrew.fastow@enron.com: 1
## 3rd Qu.: -37926 3rd Qu.:112815 ben.glisan@enron.com : 1
## Max. : -833 Max. :137864 bill.cordes@enron.com : 1
## NA's :96 NA's :128 (Other) :106
## exercised_stock_options expenses from_messages
## Min. : 3285 Min. : 148 Min. : 12.00
## 1st Qu.: 506765 1st Qu.: 22479 1st Qu.: 22.75
## Median : 1297049 Median : 46548 Median : 41.00
## Mean : 2959559 Mean : 54192 Mean : 608.79
## 3rd Qu.: 2542813 3rd Qu.: 78408 3rd Qu.: 145.50
## Max. :34348384 Max. :228763 Max. :14368.00
## NA's :43 NA's :50 NA's :58
## from_poi_to_this_person from_this_person_to_poi loan_advances
## Min. : 0.00 Min. : 0.00 Min. : 400000
## 1st Qu.: 10.00 1st Qu.: 1.00 1st Qu.: 1200000
## Median : 35.00 Median : 8.00 Median : 2000000
## Mean : 64.90 Mean : 41.23 Mean :27975000
## 3rd Qu.: 72.25 3rd Qu.: 24.75 3rd Qu.:41762500
## Max. :528.00 Max. :609.00 Max. :81525000
## NA's :58 NA's :58 NA's :141
## long_term_incentive other poi restricted_stock
## Min. : 69223 Min. : 2 False:126 Min. :-2604490
## 1st Qu.: 275000 1st Qu.: 1203 True : 18 1st Qu.: 252055
## Median : 422158 Median : 51587 Median : 441096
## Mean : 746491 Mean : 466411 Mean : 1147424
## 3rd Qu.: 831809 3rd Qu.: 331983 3rd Qu.: 985032
## Max. :5145434 Max. :10359729 Max. :14761694
## NA's :79 NA's :53 NA's :35
## restricted_stock_deferred salary shared_receipt_with_poi
## Min. :-1787380 Min. : 477 Min. : 2.0
## 1st Qu.: -329825 1st Qu.: 211802 1st Qu.: 249.8
## Median : -140264 Median : 258741 Median : 740.5
## Mean : 621893 Mean : 284088 Mean :1176.5
## 3rd Qu.: -72419 3rd Qu.: 308606 3rd Qu.:1888.2
## Max. :15456290 Max. :1111258 Max. :5521.0
## NA's :127 NA's :50 NA's :58
## to_messages total_payments total_stock_value
## Min. : 57.0 Min. : 148 Min. : -44093
## 1st Qu.: 541.2 1st Qu.: 396934 1st Qu.: 494136
## Median : 1211.0 Median : 1101393 Median : 1095040
## Mean : 2073.9 Mean : 2641806 Mean : 3352073
## 3rd Qu.: 2634.8 3rd Qu.: 2087530 3rd Qu.: 2606763
## Max. :15149.0 Max. :103559793 Max. :49110078
## NA's :58 NA's :21 NA's :19
## name
## ALLEN PHILLIP K : 1
## BADUM JAMES P : 1
## BANNANTINE JAMES M: 1
## BAXTER JOHN C : 1
## BAY FRANKLIN R : 1
## BAZELIDES PHILIP J: 1
## (Other) :138
The histograms of all of the financial variables highlight a variety of distributions present in the data-set. Many of the variables have missing values and have these dropped for the plot.
There are no normal distributions, nearly all have skewed distributions.
Data transforms are applied to better represent the distribution of values. Either log10 or sqrt are applied. This gives a better visualization of the over dispersed variables.
The variables sourced from emails also show highly skewed variables.
The data transformations give a better visualization of the distribution of values, a number of which have more of a normal distribution after a log10 transformation.
For the financial variables frequency polygons are used to investigate persons of interest.
A few of the variables are difficult to separate any trends between POI and normal people.
Looking at POI within the email data highlights some promising variables. Shared receipt with POI has a spike for true POI but is overlain by a number of non-POI responses as well.
From POI to this person suggests higher numbers can be related to other POIs.
From this person to POI is challenging, with mixed occurrences of True within the distribution.
The pair plot gives a way to quickly see any highly correlated variables.
Loan advances is removed as it has too few data points.
Investigating the correlation between to messages and shared receipt with POI.
Investigating the correlation of from messages vs from a person of interest.
The plot uses ratios of email variables to highlight the persons of interest and how they vary from non-persons of interest.
The plot uses ratio of total payments and bonus against salary to try and separate out POIs.